Hybrid Algorithm for Data Compression Using Genetic and 
Huffman Algorithm 

Abstract 

Text compression plays an important role and it is an essential 
object to decrease storage size and increase the speed of data 
transmission through communication channels . 

In this research , a hybrid compression system is introduced 
which depends on the genetic algorithm for finding the best Huffman 
tree which give the best compression ratio, and then applying 
another compression method named Oring bits on the results of the 
primary compression by applying the compression method, and also 
using a decompression algorithms for both Huffman and Oring bits. 

The variety in the text characters leads to the variety in the 
Huffman trees and finally obtaining the best possible compression . 
In addition, the increase in the frequencies of the text has an affect on 
the compression rate . 

Structure of proposed system 

Introduction 

In this section, Figure (1) illustrates the block diagram for the 
proposed system and its algorithms, and it has reviewed the system 
mechanism in compression based on some examples, where it 
illustrates the compression operation through building the trees and 
codes and applying them . 
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Creating the Individual & the Initial Population 

First, the text is wanted to be compressed and find the best 
Huffman tree for it is inputted from the notepad as a file. It has used 
text files which are read from the proposed system depending on the 
text; the initial population is created, where an individual is 
produced and from this individual (chromosome) the other 
individuals of the population are found randomly. 
For example: 

Text: abccddeeee 



Produces the individual (chromosome) 



2 



And from this individual the initial population is produced randomly 
as follows: 
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The chromosome (individual) length depends on the variety in the 
text of the inputted characters. The frequency for each character is 
computed and then the probability for each character is found. 



Building Huffman Tree 

After generating the initial population, the Huffman tree is 
built for each individual in the population depending on their 
probabilities. The codeword is determined from the tree for each 
gene, for example in Figure (2): 
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Figure (2) Huffman Codes 



Where is: 

a=000 , b=001 , c=01 , d=10 , e=ll 
Evaluation 

Every individual is evaluated by computing the fitness 
function. The fitness function here is the variance that depends on 
computing the average size for each individual. 

A=r (Pi* ai ) , a) 

i=l 

Variance = 1" p^-A) 2 , (2) 

i=l 

where, 

Pi: probability 

a i: number of bits(length code word) 
A: average size 

For example: 

The average size for the previous example shown in Figure (2) is 
computed as follows: 

0.4*2 + 0.2*2 + 0.2*2 + 0.1*3 + 0.1*3 = 2.2 

Then the variance is computed as follows: 
0.4(2-2.2) 2 + 0.2(2-2.2) 2 + 0.2 (2-2.2) 2 + 0.1(3-2.2) 2 + 0.1 (3-2.2) 2 = 
0.160 
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Tournament Selection 

After the evaluation operation has been made, the two parents 
that will take place in the crossover operation are selected randomly. 
Two randomly individuals are chosen to produce the subpopulation; 
the individual with the least variance is selected to be the first parent. 
The operation is repeated to get the second parent according to the 
same parameters above. 

The Figure (3) illustrates an example of a binary tournament 
selection. The population consists of a set of chromosomes whose 
genes are from the English alphabetic A. 
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Figure (3) Binary Tournament Selection Example 



Crossover Operation 

In the proposed system, the cycle crossover (CX) has been 
chosen as one of the permutation crossover operators that mate the 
matching with the problem. This type of crossover gives a variety of 
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individuals; in addition it avoids the conflict in the genes constructing 
the chromosome (individual), which is the most important property 
that must be available in the proposed system. 
The following example in Figure (4) describes that: 
Cycle elements 



Parent 1 



Parent 2 



\7 

i 




\7 

d 




Offspring 1 aceihdjbfg 
Offspring 2 ihbdgfejac 



Figure (4) Crossover Operation 

Mutation: 

From computing the probabilities, it is determined whether 
there is mutation or not. The mutation probability (Pm) is 0.0009,if 
the gene's probability is less than or equal to the Pm then mutation 
occurs at that gene. On condition mutation occurs, the gene's location 
that happened at it the mutation is exchanged with the succeeding 
gene. If the mutation occurr at the last gene, in this condition this 
gene's location is exchanged with the first gene. Figure (5) illustrates 
the mutation operation: 



6 



Mutation 



c e 



Figure(5) Mutation Operation 
4.7 Evaluate Offspring and Replacement 

The new offspring produced by the crossover operation after 
building the Huffman tree for each new individual are evaluated and 
the fitness function is computed. 

Then the new offspring is compared with the worst individuals 
in the population (biggest variance), the offspring are exchanged with 
the worst individuals in case the offspring is better than or equal to 
the selected individuals in order to get the best variety in the 
population. 

In other cases, no exchange happens. After finishing the 
evaluation operation, there will be a new selection and crossover 
operation again and continue in this way. 
Stop Criteria 

The operation including the crossover operation and 
generating a new population is continued for some generations (may 
be thousands of times). 

After this, we will consider that one of the individuals with the 
best fitness (the least variance) prevalent in the population. The rate 
of prevailing may reach 50% or more and this rate is one of the stop 
criteria; the other stop criteria are the number of cycles (generations) 
that can reach 1000 generation. 
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As a result, the best individual is the most prevalent in the last 
population. 



Decompression Process 
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Result Analysis 

The number of generation is 200 and the number of individuals is 15. 

abccddeeeeabccddeeeeabccddeeeeabccddeeee 

Table (1) Allcode of Example 1. 



Individual 


Allcode 


Variance 


abcde 


14 


1.360 


decba 


13 


0.960 


ebdca 


12 


0.160 



where the allcode represents the number of bits that the Huffman 
tree consists of. 
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The variance table (2) that is obtained after 300 generation is :- 



Table (2)Allcode of Example 2. 





teydorclmnqwba 


54 


0.213 


tyeqorclabnwmd 


55 


0.367 


abcdeqwrtynmlo 


56 


0.521 


endmylabwcrotq 


57 


0.675 


mcrloaytdwqnbe 


57 


0.828 


elnrtwbmaoqcdy 


58 


0.982 



As it is noticed from the Table (2), the same allcode value gives 
two different variance values. In this case, the structure of the tree 
has an effect on the codeword length and finally has an effect on the 
variance. 

Table (3) illustrates a set of examples taking into consideration 
different file sizes, while the chromosome length, population size, and 
generation number are fixed. 

Table (3) Effect of Different File Size 





File 


File 


Chrom. 


Population 


Generation 


Compression 




name 


size 


Length 


size 


No. 


Ratio 


1 


Filel 


150B 


5 


15 


200 


63.333% 


2 


File 2 


300B 


5 


15 


200 


68.000% 


3 


File 3 


600B 


5 


15 


200 


70.333% 
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Table (4) illustrates a set of examples taking into consideration 
different chromosome lengths, while the other parameters are fixed. 



Table (4) Effect of Different Chromosome Length 





File 


File 


Chrom. 


Population 


Generation 


Compression 




name 


size 


Length 


size 


No. 


Ratio 


1 


File 3 


600B 


5 


25 


300 


70.333% 


2 


File 6 


600B 


10 


25 


300 


56.500% 


3 


File 11 


600B 


20 


25 


300 


40.333% 



Table (5) illustrates the effect of the probability of genes on the 
compression ratio. The increment in the probability of genes gives 
the gene the shortest codeword and as a result increases the 
compression ratio. 



Table (5) Effect of Probability of Gene 





File 
name 


File 

size 


Chrom. 
Length 


Population 




mm 


1 


Filel 


150B 


5 


15 


200 


63.333% 


2 


Filel2 


100B 


5 


15 


200 


71.000% 


3 


FilelO 


90B 


5 


15 


200 


71.111% 



After applying the Oring bit algorithm on some of the above 
files, taking into consideration that the ratio of zeros in the file must 
be above 65% in order to get on good results. 

The files that satisfied the condition and gave good 
results are illustrated in Table (6). 



10 



Table (6) Results of Oring Files 




File size 



Chrom. 
Length 



Compression 
ratio without 



Compression 
ratio with 



77.320 % 



79.186 % 



84.615 % 



Also the proposed system has been applied on segments from 
DNA file taking different file sizes and results are viewed as follows: 
Table (7) Results of DNA Files 









DNA 1 


72 B 


59.722% 


DNA 2 


153 B 


67.320% 


DNA 3 


300 B 


71.333% 


DNA 4 


542 B 


72.878% 



It has been noticed that there was not a large variety in the 
individuals because of the small number of genes (only four genes). It 
has also been noticed that it was difficult to get good results from 
applying the Oring on the DNA file because there was no occurrence 
of a large sequence of single genes. 
Conclusions 

1. The subpopulation size has an effective role in reaching to the 
optimal solution; if the size of the subpopulation is more than 3, 
this will lead to premature conversion to a solution that may 
not be the optimal solution. 

2. The same allcode value gives two different variance values. In 
this case, the structure of the tree has an effect on the codeword 
length and finally has an effect on the variance. 
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3. The number of individuals (population size) has an effect on 
obtaining the best tree where the small population size leads to 
a small variety and as a result leads to small crossover 
operations between the individuals because of getting the same 
first gene. As a result, the best individual (tree) will not be 
obtained. 

4. The number of generations has an effect on reaching to the best 
Huffman tree where continuing in selecting various individuals 
gives a larger chance in selecting the best possible individual. 

5. In order to get on the best compression for the Oring method 
when merging the Huffman compression method and Oring, 
the character with the largest probability takes the value '0' 
instead of the value '1' in building the Huffman tree. 

6. The chromosome length has an effect on the compression ratio. 
Whenever the chromosome length was longer, the compression 
ratio is smaller. This happens because the increase in the size of 
the Huffman tree causes as a result an increase in the number 
of bytes transmitted through the transmission channel. 

Future Works 

1- Applying the meta -Genetic Algorithm on Genetic Algorithm 
(Optimization of GA). 

A meta-genetic algorithm has been used to optimize the 
genetic algorithm for cell placement. The three parameters 
optimized are the crossover rate, inversion rate and mutation 
rate. The meta-genetic algorithm is itself a genetic optimization 
process which runs the genetic algorithm to solve a placement 
problem and manipulates the genetic parameters to optimize 
the fitness of the genetic algorithm. The individuals in the 
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population of meta-genetic algorithm consist of three integers 
in the range [0,20], representing the mutation rate, inversion 
rate, and crossover rate for the genetic algorithm. 

2- Using Breeder Genetic Algorithm (BGA):- 

BGA represents a class of random optimization 
techniques gleaned from the science of population genetics, 
which have proved their ability to solve hard optimization 
problems with continuous parameters. BGA which can be seen 
as a recombination between Evaluation strategies (ES)and 
Genetic Algorithm(GA), uses truncation selection which is 
very similar to the (u ,X)strategy in ESs and the search process 
is mainly driven by recombination making BGAs very similar 
to GAs. It has been proven that BGAs can solve problems more 
efficiently than GAs due to the theoretical faster convergence 
to the optimum and they can, like GAs, be easily written in a 
parallel form. 

3- Applying the compression algorithm, the (Delta algorithm) on 
the research results for decreasing the cost of communication 
channels (reducing band width). 

After finishing the compression operation and converting the 
sequence of binary numbers to integer numbers and before 
saving it as bytes, the delta algorithm is applying in order to 
reduce the value of the integer numbers and finally reducing 
the amplitude for the transmitted signal and this gives us a 
small band-width. 
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4- The proposed system can be applied on images by dealing with 
an 

image as segments where Huffman algorithm is applied on 
each 

segment. 
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